Class network routing

ABSTRACT

Class network routing is implemented in a network such as a computer network comprising a plurality of parallel compute processors at nodes thereof. Class network routing allows a compute processor to broadcast a message to a range (one or more) of other compute processors in the computer network, such as processors in a column or a row. Normally this type of operation requires a separate message to be sent to each processor. With class network routing pursuant to the invention, a single message is sufficient, which generally reduces the total number of messages in the network as well as the latency to do a broadcast. Class network routing is also applied to dense matrix inversion algorithms on distributed memory parallel supercomputers with hardware class function (multicast) capability. This is achieved by exploiting the fact that the communication patterns of dense matrix inversion can be served by hardware class functions, which results in faster execution times.

CROSS-REFERENCE

The present invention claims the benefit of commonly-owned, co-pendingUS. Provisional Patent Application Ser. No. 60/271,124 filed Feb. 24,2001 entitled MASSIVELY PARALLEL SUPERCOMPUTER, the whole contents anddisclosure of which is expressly incorporated by reference herein as iffully set forth herein. This patent application is additionally relatedto the following commonly-owned, co-pending United States PatentApplications filed on even date herewith, the entire contents anddisclosure of each of which is expressly incorporated by referenceherein as if fully set forth herein. U.S. patent application Ser. No.10/468,999, for “Class Networking Routing”; U.S. patent application Ser.No. 10/469,000, for “A Global Tree Network for Computing Structures”;U.S. patent application Ser. No. 10/468,997, for ‘Global Interrupt andBarrier Networks”; U.S. patent application Ser. No. 11/868,223, for‘Optimized Scalable Network Switch”; U.S. patent application Ser. No.10/468,991, for “Arithmetic Functions in Torus and Tree Networks’; U.S.patent application Ser. No. 10/468,992, for ‘Data Capture Technique forHigh Speed Signaling”; U.S. patent application Ser. No. 10/468,995, for‘Managing Coherence Via Put/Get Windows’; U.S. patent application Ser.No. 12/196,796, for “Low Latency Memory Access And Synchronization”;U.S. patent application Ser. No. 10/468,990, for ‘Twin-Tailed Fail-Overfor Fileservers Maintaining Full Performance in the Presence ofFailure”; U.S. patent application Ser. No. 10/468,996, for “FaultIsolation Through No-Overhead Link Level Checksums’; U.S. patentapplication Ser. No. 10/469,003, for “Ethernet Addressing Via PhysicalLocation for Massively Parallel Systems”; U.S. patent application Ser.No. 10/469,002, for “Fault Tolerance in a Supercomputer Through DynamicRepartitioning”; U.S. patent application Ser. No. 10/258,515, for“Checkpointing Filesystem”; U.S. patent application Ser. No. 10/468,998,for “Efficient Implementation of Multidimensional Fast Fourier Transformon a Distributed-Memory Parallel Multi-Node Computer”; U.S. patentapplication Ser. No. 10/468,993, for “A Novel Massively ParallelSupercomputer”; and U.S. patent application Ser. No. 10/083,270, for“Smart Fan Modules and System”.

This invention was made with Government support under subcontract numberB517552 under prime contract number W-7405-ENG-48 awarded by theDepartment of Energy. The Government has certain rights in thisinvention.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates generally to a class network routing, andmore particularly pertains to class network routing which implementsclass routing in a network such as a computer network comprising aplurality of parallel compute processors at nodes thereof, and whichallows a compute processor to broadcast a message to one or more othercompute processors in the computer network, such as processors in acolumn or a row. Normally this type of operation requires a separatemessage to be sent to each processor. With class network routingpursuant to the invention, a single message is sufficient, whichgenerally reduces the total number of messages in the network as well asthe latency to do a multicast.

The present invention relates to the field of message-passing datanetworks, for example, a network as used in a distributed-memorymessage-passing, parallel computer, as applied for example tocomputation in the field of life sciences.

The present invention also uses the class function on a torus computernetwork to do dense matrix calculations. By using the hardwareimplemented class function on the torus computer network it is possibleto do high performance dense matrix calculations.

The present invention also relates to the field of distributed-memory,message-passing parallel computer design and system software, as appliedfor example to computation in the field of life sciences. Morespecifically it relates to the field of high performance linear algebrasoftware for distributed memory parallel supercomputers.

2. Discussion of the Prior Art

A large class of important computations can be performed by massivelyparallel computer systems. Such systems consist of many compute nodes,each of which typically consist of one or more CPUs, memory, and one ormore network interfaces to connect it with other nodes.

The computer described in related U.S. provisional application Ser. No.60/271,124, filed Feb. 24, 2001, for A Massively Parallel Supercomputer,leverages system-on-a-chip (SOC) technology to create a scalablecost-efficient computing system with high throughput. SOC technology hasmade it feasible to build an entire multiprocessor node on a single chipusing libraries of embedded components, including CPU cores withintegrated, first-level caches. Such packaging greatly reduces thecomponents count of a node, allowing for the creation of a reliable,large-scale machine.

A message-passing data network serves to pass messages between nodes ofa network, each of which can perform local operations independently ofother nodes. Nodes can act in concert by passing messages between themover the network. An example of such a network is a distributed-memoryparallel computer wherein each of its nodes has one or more processorsthat operate on local memory. An application using multiple nodes ofsuch a computer coordinates the actions of the multiple nodes by passingmessages between them. The words switch and router are usedinterchangeably throughout this specification.

A message-passing data network consists of switches and links, wherein alink merely passes data between two switches. A switch routes incomingdata from a node or link to another node or link. A switch may beconnected to an arbitrary number of nodes and links. Depending on theirlocation in the network, a message between two nodes may need totraverse several switches and links.

Prior art networks efficiently support some types of message passing,but not all types. For example, some networks efficiently supportunicast message passing to a single receiving node, but not multicastmessage passing to an arbitrary number of receiving nodes. Efficientsupport of multicast message passing is required in various situations,such as numerical algorithms executed on a distributed-memory parallelcomputer, which is a requirement in the applications disclosed hereinfor dense matrice inversion using class functions.

Many user applications need to invert very large N by N (N×N) densematrices, where N is greater than several thousand. Dense matrices arematrices that have most of their entries being non-zero. Typically,inversion of such matrices can only be done using large distributedmemory parallel supercomputers. Algorithms that perform dense matrixinversions are well known and can be generalized for use in distributedmemory parallel supercomputers. In that case a large amount ofinter-processor communication is required. This can slow down theapplication considerably.

SUMMARY OF THE INVENTION

Accordingly, it is a primary object of the present invention to provideclass network routing which implements class routing in a network whichallows a compute processor to broadcast a message to a range ofprocessors, such as processors in a column or a row. Normally this typeof operation requires a separate message to be sent to each processor.With class routing pursuant to the present invention, a single messageis sufficient, which generally reduces the total number of messages inthe network as well as the latency to do a broadcast. The class networkrouting enhances a network such that it more efficiently supports someadditional types of message passing.

Class routing enhances a network to more efficiently support additionaltypes of message passing. As usual, a message is divided into one ormore packets which pass atomically through the network. Class routingadds a class value to each packet. At each switch, the class value isused as an index to one or more tables, whose stored values determinethe actions performed by the switch on the packet. An index-basedtable-lookup is fast and efficient, as required for maximal throughputand minimal latency across a switch.

Class routing can be summarized as an efficient encoding and decoding ofinformation needed by a switch to act on a packet, to enable the networkto provide certain types of message passing. The information is encodedin the class value of the packet and in the tables of the switches. Theinformation is decoded by using the class value of a packet as an indexto the tables.

A network without class routing is referred to as a basic network. Withclass routing, it is an enhanced network. With the appropriate entriesin the class tables of all the switches, one or more classes of theenhanced network can provide the message-passing types of the basicnetwork. Moreover, since using the class value of a packet as an indexto a table is fast, the message-passing types of the basic network arenot appreciably slowed down by the enhancement when compared with thebasic network.

Other entries in the class tables can provide message-passing typesbeyond those of the basic network. For example, the unicast messagepassing of a basic network can be enhanced by class routing topath-based multidrop message passing for multiphase multicasting.

In the classes described above, the enhanced network provides themessage-passing types of the basic network, either unmodified orenhanced. In addition, some classes of the enhanced network couldoverride the basic network. For example, overriding classes can providemultidestination. message passing for single-phase multicasting. Ifclass routing provides the only message-passing types, then nounderlying basic network is required.

The present invention makes dense matrix inversion algorithms ondistributed memory parallel supercomputers with hardware class functioncapability perform faster. A hardware class function is a particular useof class routing. This is achieved by exploiting the fact that thecommunication patterns of dense matrix inversion can be served byhardware class functions. This results in faster execution times.

If the parallel supercomputer possesses class function capability at thehardware level, then the particular communication patterns of densematrix inversion can be exploited by using class functions in order tominimize the communication delay. For example, provisional applicationSer. No. 60/271,124 describes a computer with function capability at thehardware level.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing objects and advantages of the present invention for aclass network routing may be more readily understood by one skilled inthe art with reference being had to the following detailed descriptionof several embodiments thereof, taken in conjunction with theaccompanying drawings wherein like elements are designated by identicalreference numerals throughout the several views, and in which:

FIG. 1 illustrates an exemplary distributed-memory parallelsupercomputer that includes 9 nodes interconnected via amultidimensional grid utilizing a 2-dimensional 3×3 Torus-networkaccording to the present invention;

FIG. 2 illustrates in more detail an exemplary node Q00 of the ninenodes of the distributed-memory parallel supercomputer of FIG. 1;

FIG. 3 illustrates an exemplary single phase multicast from node Q00 tothe other 8 nodes of the distributed-memory parallel supercomputerillustrated in FIG. 1.

FIG. 4 illustrates a 4×4 grid of processors wherein each processor islabeled by its row, column numerals.

DETAILED DESCRIPTION OF THE INVENTION

The distributed-memory parallel supercomputer described in U.S.provisional application Ser. No. 60/271,124 comprises a plurality ofnodes. Each of the nodes includes at least one processor, which operateson a local memory. The nodes are interconnected as a multidimensionalgrid and they communicate via grid links. Without losing generality andin order to make the description of this invention easily understandableto one skilled in the art, the multidimensional node grid will bedescribed as an exemplary 2-dimensional grid or an exemplary3-dimensional grid. The 3-dimensional grid is implemented by aTorus-based architecture. Notwithstanding the fact that only the2-dimensional node grids or 3-dimensional. node grids are described inthe following description, it is contemplated within the scope of thepresent invention that grids of other dimensions may easily be providedbased on the teachings of the present invention. An example of 3dimensions is the 3-dimensional grid implemented on the Torus-basedarchitecture described in provisional application Ser. No. 60/271,124.

FIG. 1 is an exemplary illustration of a distributed-memory parallelsupercomputer that includes 9 nodes interconnected via amultidimensional grid utilizing a 2-dimensional 3×3 Torus network 100.It is noted that the number of nodes is in exemplary fashion limited to9 nodes for brevity and clarity, and that the number of nodes maysignificantly vary depending on a particular architectural requirementsfor the distributed-memory parallel supercomputer. FIG. 1 depicts 9nodes labeled as Q00-Q22, a pair of which is interconnected by a gridlink. In total, the 9-node Torus network 100 is interconnected by 18grid links, where each node is directly interconnected to four othernodes in the Torus network 100 via a respective grid link. It is notedthat unlike a mesh, the exemplary 2-dimensional Torus network 100includes no edge nodes. For example, node Q00 is interconnected to nodeQ20 via grid link 102; to node Q02 via grid link 104; to node Q10 viagrid link 106; and finally to node Q01 via grid link 108. As anotherexample, Node Q11 is interconnected to Node Q01 via grid link 110; tonode Q10 via grid link 112; to node Q21 via grid link 114 and finally toNode Q12 via grid link 116. Other nodes are interconnected in a similarfashion.

Data communicated between nodes is transported on the network in one ormore packets. For any given communication, more than one packet isneeded if the amount of data exceeds the packet-size supported by thenetwork. A packet consists of a packet header followed by the datacarried by the packet. The packet header contains information requiredby the torus network to transport the packet from the source node of thepacket to the destination node. In a distributed-memory parallelsupercomputer, that is implemented by the assignee of the present patentapplication, each node on the network is identified by a logical addressand the packet header includes a destination address so that the packetis automatically routed to a node on the network as identified by adestination.

FIG. 2 is an exemplary illustration of node Q00 of thedistributed-memory parallel supercomputer of FIG. 1. The node is similarto that in provisional application Ser. No. 60/271,124. The nodecontains one processor which operates on local memory. The node containsa router which sends and receives packets on the grid links102,104,106,108 connecting the node Q00 to its neighboring nodesQ20,Q02,Q10,Q01, respectively, as illustrated in FIG. 1. The nodecontains a reception buffer. If the router receives a packet destinedfor the local processor, the packet is placed into the reception buffer,from which the packet can be received by the processor. Depending on theapplication and the packet, the processor may write the contents of thepacket into memory. The node contains an injection buffers whichoperates in a first-in, first-out (FIFO) manner. If the CPU places apacket into an injection FIFO, once the packet reaches the head of theFIFO, the packet is removed from the FIFO by the router and the routerplaces the packet onto a grid link toward the destination node of thepacket.

The routing implemented by the router has several simultaneouscharacteristics. The characteristics are some of those described inprovisional application Ser. No. 60/271,124. The routing is a virtualcut-through routing. Thus if an incoming packet on one of the grid linksis not destined for the processor, then the packet is forwarded by therouter onto one of the outgoing links. This forwarding is performed bythe router without the involvement of the processor. The routing is ashortest-path routing. For example, a packet sent by node Q00 to nodeQ02 will travel over the grid link 104. Any other path would be longer.For another example, a packet sent by node Q00 to node Q11 will travelover the grid links 106 and 112 or over the grid links 108 and 110. Therouting is an adaptive routing. There may be a choice of grid links bywhich a packet can leave a node. In the previous example, the packetcould leave the node Q00 via the grid link 106 or 108. For a packetleaving a node, adaptive routing allows the router to choose the lessbusy outgoing link for a packet or to choose the outgoing link based onsome other criteria. Adaptive routing is not just performed at the:source node of a packet; adaptive routing also is performed at eachintermediate node that a packet may cut through on the packet's way tothe packet's destination node.

Class routing can be used to achieve a wide variety of types of messagepassing. Some of these types are described in the following exampleswhich describe many details of class routing.

EXAMPLE 1 Path-based Multidrop Message Passing

The network of a distributed-memory parallel computer is an example of amessage-passing data network. Each node of such a computer has one ormore processors that operate on their local memory. An application usingmultiple nodes of such a computer coordinates their actions by passingmessages between them. An example of such a computer is described inprovisional application Ser. No. 60/271,124 for A Massively ParallelSupercomputer. In that computer, each single node is paired with asingle switch of the network. In that computer, the switches areconnected to each other as a three dimensional (3D) torus. Thus in thatcomputer, each switch is linked to six other switches. These links areto a switch in the positive direction and to a switch in the negativedirection in each of the three dimensions. Each switch is identified byits (x, y, z) logical address on the 3-dimensional torus. By contrast,in a computer using a 2-dimensional torus, each switch is identified byits (x, y) logical address. In FIG. 1, the positive X direction istowards the right, and the positive Y direction is towards the bottom.In FIG. 1, node Q00 has, the logical address (0,0), node Q01 has logicaladdress (0,1) and so on. Since each node is paired with a single switch,a node has the address of its switch. By including a field for such alogical address in the packet header, the packet can efficiently andconveniently identify its destination node. Without class routing, thebasic network only provides unicast message passing. If a switch is thedestination of an incoming packet, then the packet is given to the localnode. Otherwise, the packet is put onto a link towards to thedestination node.

The following is an example using class routing to implement multidropmessage passing. Each packet header has a field for a class value. Thisvalue is either 0 or 1. Each switch has a table used to determine if, inaddition to the usual unicast routing of the packet, a copy should bedeposited at the local node. This assumes for the original unicastmessage passing, that the processor is not involved when the routerforwards a packet from one of the incoming links to one of the outgoinglinks. This assumption is satisfied by virtual cut-through routing, asimplemented for example in provisional application Ser. No. 60/271,124.This assumes for the original unicast message passing, that theprocessor is not involved when the router forwards a packet from one ofthe incoming links to one of the outgoing links. This assumption issatisfied by virtual cut-through routing, as implemented for example inthe provisional application Ser. No. 60/271,124. For the class values[0,1], the entries in this deposit table are [0,1] and demand that thepacket is not deposited or deposited, respectively. The table isillustrated below. The table only applies for a packet at a node otherthan its destination node. A packet at its destination node is depositedas in the usual unicast routing. Thus packets with class value 0 obeythe original unicast message passing. Packets with class value 1 performpath-based multidrop message passing.

For a packet NOT destined for this node class value deposit value 0 0 11

Path-based multidrop message passing can be used to implement multiphasemulticasting, as described for example in _D. K.Panda, S.Singal andP.Prabhakaran, “Multidestination Message Passing Mechanism Conforming toBase Wormhole Routing-Scheme”, PCRCW'94, LNCS 853, Springer-Verlag, pp.131-145, 1994_. The first example described here is a two phasemulticast from node (0,0) to the 9 nodes of the 3*3 torus illustrated inFIG. 1. In the first phase, node (0,0) sends a multidrop message withdestination (0,2). In the second phase, each of the 3 recipients of thefirst phase simultaneously send a multidrop message. Node (0,0) sends to(2,0); node (0,1) to (2,1) and node (0,2) to (2,2). At the end of thesecond phase, all 9 nodes of the 2-dimensional torus have received thebroadcast message.

The above assumes that in the original unicast message passing, when thesource node and destination node are in the same row, then the path ofthe packet is along that row. A row is a group of nodes which have equalvalues for all but one of the dimensions of the torus or mesh. Theassumption is guaranteed by shortest-path routing, as implemented forexample in provisional application Ser. No. 60/271,124. The aboveassumption also is guaranteed by the deterministic routing implementedin the provisional application. By contrast, the above assumption is notsatisfied by the congestion avoidance routing implemented elsewhere,which routes a packet via some random node.

The second example described here is a three phase multicast from node(0,0,0) to the 125 nodes of the 5*5*5 cube with the corners (0,0,0) and(4,4,4). In the first phase, node (0,0,0) sends a multidrop message withdestination (0,0,4). In the second phase, each of the 5 recipients ofthe first phase simultaneously send a multidrop message. Node (0,0,0)sends to (0,4,0); node (0,0,1) to (0,4,1) and so on. In the third phase,each of the 25 recipients of the second phase simultaneously send amultidrop message. Node (0,0,0) sends to (4,0,0); node (0,0,1) to(4,0,1) and so on. At the end of the third phase, all 125 nodes of thecube have received the broadcast message. The above example of a 3-phasemulticast for the 3-dimensional cube is easily generalized as follows.For a D-phase multicast from an origin node to all nodes of aD-dimensional cube wherein, in a first phase the origin node sends amultidrop message to all other nodes in one of the rows of the sendingnode, in a second phase each of the recipients of the first phase andthe sender of the first phase simultaneously send a multidrop message toall other nodes in a row orthogonal to the row of the first phase, in athird phase each of the recipients of the second phase and the sendersof the second phase simultaneously send a multidrop message to all othernodes in a row orthogonal to the rows of the first and second phases,and so on in further phases such that all node of the cube receive thebroadcast message after all the phases.

The implementation of path-based multidrop message passing using classrouting offers advantages beyond existing implementations. For example,a particular existing implementation places the deposit value into thepacket. In that implementation, every node on the path of the packetreceives a copy of the packet. In contrast, since each switch can havedifferent entries in its deposit table, class routing allows a node withthe deposit entries [0,0] to not receive a copy of a packet, even thoughthe node is on the path of the multidrop packet. The table isillustrated below. For example, with several class values formulticasting, this allows for several multicast groups, each with adifferent set of nodes.

For a packet NOT destined for this node class value deposit value 0 0 10

EXAMPLE 2 Sending Multidrop Packets without Knowing the Recipients

As described in Example 1, class routing allows a node with the depositentries [0,0] for class values [0,1 ] to not receive a copy of a packet,even though the node is on the path of the multidrop packet. Thisinformation need not be known by the source node of the multidroppacket. In other words, class routing allows a node to source amultidrop packet without knowing the recipients. However, in the networkof Example 1 there is one exception, the destination node of themultidrop packet always will receive a copy of the packet. Thus if thedestination node is to not receive a copy of the packet, this must beknown by the source node such that it can use another destination.

For example, assume node (0,0) is the source of a multidrop packetoriginally destined for node (0,2). This may be a natural destination ona torus network of size 3*3, since nodes (0,9) through (0,2) are acomplete row. If node (0,2) is to not receive a copy, then this must beknown by node (0,0). If node (0,0) also knows that node (0,1) is toreceive a copy, then (0,1) can be used as the destination of themultidrop packet.

In order to solve the exception caused by the destination node, classrouting allows each switch to have an additional table which determinesif a copy of a packet should be deposited at the destination node. Tosolve the above example, for node (0,2) the entries in. this destinationtable are [1,0] for the class values [0,1]. The entry 0 for class 1,causes node (0,2) to not receive multidrop messages, even if it is thedestination. The entry 1 for class 0 allows node (0,2) to receiveunicast messages as usual. The two tables are illustrated below.

class value deposit value For a packet destined for this node (0,2) 0 11 0 For a packet NOT destined for this node (0,2) 0 0 1 0

In the above example, node (0,2) is not a participant in the multicastwith class value 1.

As a contrasting example, node (0,1) is a participant in the multicastwith class value 1. The corresponding tables for node (0,1) areillustrated below.

class value deposit value For a packet destined for this node (0,1) 0 11 1 For a packet NOT destined for this node (0,1) 0 0 1 1

EXAMPLE 3 Snooping

Assume the network described above in Example 1, including its use ofthe class value 0 for the unicast messages of the basic network. A nodecan snoop, and acquire and store information on the unicast packetspassing through its switch by using the entry 1 for class value 0 in thedeposit table.

The table is illustrated below. In the example, the node is aparticipant in the multicast with class value 1. The table only appliesfor a packet at a node other than its destination node. In this example,a packet at its destination node is deposited as in the usual unicastrouting.

For a packet NOT destined for this node class value deposit value 0 1 11

An example use of such snooping is the investigation of the performanceof the network. Without snooping there may only be information on whenthe packet entered the network at the source node and when it exited atthe destination node. With snooping, there can be information on whenthe packet passed through a node on the path of the packet. Since theremay be multiple valid paths between a pair of nodes, snooping also canprovide information on which particular path was used. An example of arouting with multiple valid paths between a pair of nodes is adaptiverouting, as implemented for example in provisional application Ser. No.60/271,124.

Since each switch can have different entries in its deposit table, classrouting allows an arbitrary number of nodes to be snooping. If only asmall fraction of nodes in the network are snooping, then themeasurements are a statistical sampling.

Snooping is an example use of class routing not specifically related tomulticasting.

EXAMPLE 4 Single Phase Multicast

In a single phase multicast, the message is injected once into thenetwork by one of the nodes. In contrast, in a multiphase multicast, themessage is injected several times into the network, perhaps by multiplenodes. For example, in the multiphase multicast on the 3*3 node torusdescribed above in Example 1, the message is injected a total of 1+3=4times by 3 different nodes. For example, in the multiphase multicast onthe 5*5*5 node torus described above in Example 1, the message isinjected a total of 1+5+25=31 times by 25 different nodes.

As well known, to provide single phase multicast, a switch must be ableto duplicate an incoming packet onto multiple outgoing links. Inessence, the message duplication performed by a node in multiphasemulticasting is performed by a switch in single phase multicasting.

The advantage offered by class routing for single phase multicasting isan efficient encoding and decoding of which of the outgoing switches door do not receive a copy of a particular incoming packet. After a simpleexample describing the encoding and decoding scheme offered by classrouting, the scheme is compared to existing schemes.

The first example described here is the same multicast described inExample 1 from node (0,0) to the 9 nodes of the 3*3 torus illustrated inFIG. 1. In Example 1 it is a two phase multicast; here it is a singlephase multicast. Here the pattern of messages across the network ischosen to be similar to that of Example 1.

Each packet header has a field for a class value. This value is either 0or 1. Each switch has a table used to determine if the usual unicastrouting of the packet is to be performed or if the actions of singlephase multicast routing are to be performed. Each entry in the table isa bit string of the format UDXY. If in a table entry U is 1, then theusual unicast routing is to be performed, otherwise not. If D is 1, thena copy of the packet is to be deposited at the local node, otherwisenot. If X is 1, then a copy of the packet is to go out the positive Xlink, otherwise not. If Y is 1, then a copy of the packet is to go outthe positive Y link, otherwise not. The two links in the negative X andY direction are irrelevant to the example and are ignored here forsimplicity.

For class value 0, the entry in the table is 1000 on all nodes. Thuspackets with class value 0 obey the original unicast message passing.For class value 1, the entry in the table depends on the location of theswitch in the network. The entry at each switch mimics the actions ofthe corresponding node in the multiphase multicast of Example 1.

At each node, the table is obeyed for all packets entering the node. Ifa packet has class value 0, then the UDXY=1000 identifies the packet asa unicast packet and only then is the destination of the packetexamined.

For class value 1, switch (0,0) has the entry 0011. This assumes thatthe source node of the multicast does not need another copy. The tablefor node (0,0) is illustrated below.

For a packet at node (0,0) class value UDXY value 0 1000 1 0011

Continuing with class value 1 for the other switches in the 3*3 torus,the switch (0,1) has the entry 0111. The four switches (0,2), (1,0),(1,1), and (1,2)-have the entry 0101. The three switches (2,0), (2,1)and (2,2) have the entry 0100. The above is a complete encoding of theinformation required for the example multicast using class 1. In short,packets with class value 0 obey the original unicast message passing.Packets originating from node (0,0) with class value 1 perform singlephase multicast routing.

The above UDXY values at each node for multicast from node (0,0) usingclass 1 is illustrated in FIG. 3. At each node, the circle is open ifD=0, that is, if no copy of the packet is to be deposited at the node.At each node, the circle is closed if D=1, that is, if a copy of thepacket is to be deposited at the node. At each node, there is an arrowin the positive X direction, if X=1, that is, if a copy of the packet isto go out the positive X link. At each node, there is an arrow in thepositive Y direction, if Y=1, that is, if a copy of the packet is to goout the positive Y link.

The second example described here is the same multicast described inExample 1 from node (0,0,0) to the 125 nodes of the 5*5*5 cube with thecorners (0,0,0) and (4,4,4). In Example 1 it is a three phase multicast;here it is a single phase multicast. Here the pattern of messages acrossthe network is chosen to be similar to that of example 1.

Each packet header has a field for a class value. This value is either 0or 1. Each switch has a table used to determine if the usual unicastrouting of the packet is to be performed or if the actions of singlephase multicast routing is to be performed. Each entry in the table is abit string of the format UDXYZ. If in a table entry U is 1, then theusual unicast routing is to be performed, otherwise not. If D is 1, thena copy of the packet is to be deposited at the local node, otherwisenot. If X is 1, then a copy of the packet is to go out the positive Xlink, otherwise not. Similar for the bits Y and Z. The three links inthe negative X, Y and Z direction are irrelevant to the example and areignored here for simplicity.

For class value 0, the entry in the table is 10000 on all nodes. Thuspackets with class value 0 obey the original unicast message passing.For class value 1, the entry in the table depends on the location of theswitch in the network. The entry at each switch mimics the actions ofthe corresponding node in the multiphase multicast of Example 1.

For class value 1, switch (0,0,0) has the entry 00111. This assumes thatthe source node of the multicast does not need another copy. The threeswitches (0,0,1) through (0,0,3) have the entry 01111. Switch (0,0,4)has the entry 01110. The fifteen switches in the x=0 plane with thecorners (0,1,0), (0,1,4), (0,3,0) and (0,3,4) have the entry 01110. Thefive switches (0,4,0) through (0,4,4) have the entry 01100. The 75switches of the cube with the corners (1,0,0), (1,0,4), (3,0,0) and(3,0,4) have the entry 01100. The 25 switches in the x=4 plane with thecorners (4,0,0), (4,0,4), (4,4,0) and (4,4,4) have the entry 01000. Theabove is a complete encoding of the information required for the examplemulticast using class 1. In short, packets with class value 0 obey theoriginal unicast message passing. Packets originating from node (0,0,0)with class value 1 perform single phase multicast routing.

In the above example of class routing for single phase multicasting, theUDXYZ bit string determines onto which output ports a packet is to beduplicated. A similar bit string is used in some existingimplementations of single phase multicasting. An example is described in_R.Sivaram, R.Kesavan, D. K.Panda, C. B.Stunkel, “Architectural Supportfor Efficient Multicasting in Irregular Networks”, IEEE Trans. On Par.And Dist. Systems, Vol. 12, No. 5, May 2001_. Another example isdescribed in _U.S. Pat. No. 5,333,279: Self-timed mesh routing chip withdata broadcasting, D.Dunning_. In these existing implementations, a bitstring similar to the above UDXYZ for each switch is in the packetheader. In contrast, in the above class routing implementation, thepacket header merely contains the class value which is used at eachswitch to look up in a table the UDXYZ entry.

The above class routing implementation of single-phase multicasting isin some ways less general than these existing implementations, but theclass routing is in some ways more efficient. For example, in the packetheader, a field for a class value is much smaller than a field for a bitstring for each switch. In the above example, the class value is 0 or 1and thus can be stored in a one-bit field in the header. In contrast,the above UDXYZ bit string would require a five-bit field in the header.Moreover, several fields for UDXYZ values would be required, sincedifferent switches have different values for UDXYZ. The smaller field inthe header is more efficient since it consumes less of the physicalbandwidth of the torus network, leaving more bandwidth for theapplication data. The smaller field also allows for a smaller latency,since typically at a switch, the entire header must be received andchecked for errors, before the packet can be forwarded.

EXAMPLE 5 Single Phase Multicast from Any Node in the Network

The single phase multicast using class routing described in Example 4allows a single. node to be the source of the message. In the example onthe 2-dimensional 3*3 torus, the source is the node (0,0). In theexample on the 3-dimensional 5*5*5 torus, the source is the node(0,0,0). We'll name this a heterogeneous single phase multicast, sincethe class routing table has different values at different nodes. Thetable only is used for one of the input links.

Class routing also can be used to implement a single phase multicastwhere the source can be any node in the network. We'll name this ahomogenous single phase multicast, since on a homogeneous network suchas a torus the class routing tables have the same value on every node.On a single node, the class routing tables have different values on thedifferent incoming links.

The first example described here is the same multicast described inExample 4 from node (0,0) to the 9 nodes of the 3*3 torus illustrated inFIG. 1. In Example 4 it is a heterogeneous single phase multicast; hereit is a homogenous single phase multicast. Here the pattern of messagesacross the network-is chosen to be similar to that of example 4.

In the heterogeneous single phase multicast of example 4, a packetarriving at a node via any of the incoming links uses the same table todetermine the actions to be performed by the switch on the packet basedon the class value. As demonstrated in example 4, for the heterogeneousmulticast, different nodes have different values in the table. Bycontrast, in the homogenous single phase multicast of this example, eachincoming link on each switch has a table used to determine the actionsto be performed on an incoming packet. As demonstrated below, for thehomogeneous multicast, different nodes have the same values in thetables.

Each packet header has a field for a class value. This value is either 0or 1. Each incoming link on each switch has a table used to determine ifthe usual unicast routing of the packet is to be performed or if theactions of single phase multicast routing is to be performed. Each entryin the table is a bit string of the format UDXY. If in a table entry Uis 1, then the usual unicast routing is to be performed, otherwise not.If D is 1, then a copy of the packet is to be deposited at the localnode, otherwise not. If X is 1 and the X-destination of the packet isnot the X-location of the node, then a copy of the packet is to go outthe positive X link, otherwise not. If Y is 1 and the Y-destination ofthe packet is not the Y-location of the node, then a copy of the packetis to go out the positive Y link, otherwise not. For each node, thetwo-outgoing links in the negative X and Y directions are irrelevant tothe example and are ignored here for simplicity. For each node, the twoincoming. links in the negative X and Y directions are irrelevant to theexample and are ignored here for simplicity.

As described above, the X-destination and the Y-destination of thepacket are determined in order to determine the actions performed on thepacket. Thus for node (0,0) to broadcast to all other 8 nodes of the 3*3torus, the packet must have the destination (3,3).

In general for a broadcast in this example, the destination of thepacket is the furthest node in the positive X and positive Y directionfrom the source of the broadcast. For example, for node (1,0) tobroadcast to all other 8 nodes of the 3*3 torus, the packet must havethe destination (0,2).

For class value 0, the entry in the table is 1000 on all tables on allnodes. Thus packets with class value 0 obey the original unicast messagepassing. For class value 1, the entry in the table depends on whichincoming link the packet arrived on. The tables are illustrated below.The entry for each incoming link are such that the resulting homogeneousmulticast mimics the heterogeneous multicast of Example 4.

class value UDXY value For a packet incoming on the link from thenegative x direction 0 1000 1 0111 For a packet incoming on the linkfrom the negative y direction 0 1000 1 0011

The above is a complete encoding of the information required for theexample multicast using class 1. In short, packets with class value 0obey the original unicast message passing. Packets with class value 1perform a homogeneous single phase multicast routing.

Given the above 2-dimensional torus example, the technique is easilyextended to other networks. Class 1 in the above example can beconsidered to provide multicasting in the positive X and positive Yquadrant of a mesh. Three additional similar classes 2, 3 and 4 couldprovide multicasting in the other three quadrants: negative X andpositive Y; positive X and negative Y; as well as negative X andnegative Y. These four classes allow any node in the mesh to use fourmulticasts to effectively broadcast a packet to all other nodes in themesh. Using the same broadcast technique on the torus would be twice asfast as the single class technique described above. It is twice as fastsince the distance between the source node and the destination nodes ishalved. This technique is feasible since any node on a torus can betreated as a node in the middle of a mesh.

The above technique is easily generalized to a mesh or torus of Ddimensions. On a D dimensional mesh or torus, 2*D classes allow any nodein the mesh or torus to use 2*D multicasts to effectively broadcast apacket to all other nodes in the mesh or torus. On the torus, thealternative single broadcast to all the nodes will require twice as longto complete as the 2*D multicasts on the torus since the distancebetween the source node and the furthest destination is double for thesingle broadcast.

Enhancements and Alternatives to Class Tables

Instead of or in addition to using tables on the switch, the class valueand perhaps other characteristics of the packet can be input to analgorithm. If table entries are the same for all class values, then itmight be better to use a algorithm If a switch needs to decide betweenconflicting actions demanded by tables, as which can be programmed withthe relative priorities of different tables.

Using Class-based Multicasting to Create other Classes

In Example 5, class value 0 is used for the usual unicast, while classvalue 1 can be used to broadcast to all nodes in the torus. Havingestablished a broadcast mechanism, it can be used to broadcast any data.For example, this data could be the class table entries for otherclasses. For example, Example 5 identified a need for the additionalclasses 2,3 and 4. Once multicasting on class 1 is established bywhatever means, class 1 can be used to create classes 2,3 and 4. Ingeneral, once communication on a particular class value or values isestablished, that communication can be used to establish communicationon other class values.

EXAMPLE 7 Dense Matrix Calculation Using Class Function

The present invention also uses the class function on a torus computernetwork to do dense matrix calculations. By using the hardwareimplemented class function on the torus computer network it is possibleto do high performance dense matrix calculations.

Class function is the name used in this example for multicasting basedon class network routing. Often, the multicast is to other nodes in thesame row. So often it is sufficient for class routing-to implement asingle phase of path-based multidrop message passing, which is describedin Example 1. When the multicast is not to a row, it is to a plane, cubeor other higher dimension subset of the torus or mesh. In this case,optimal performance demands that class routing implement a moresophisticated multicast, such as the single phase multicast described inExample 5.

The present invention makes dense matrix inversion algorithms ondistributed memory parallel supercomputers with hardware class functioncapability perform faster. This is achieved by exploiting the fact thatthe communication patterns of dense matrix inversion can be served byhardware class functions. This results in faster execution times.

The algorithms as discussed herein are well known in the art, and arediscussed, for example, in NUMERICAL RECIPES IN FORTRAN, THE ART OFSCIENTIFIC COMPUTING, Second Edition, by William H. Press, et al.,particularly at page 27 et. seq.

FIG. 4 illustrates a 4×4 grid of processors wherein each processor islabeled by its row, column numerals. For example the processor in row 2column 3 is p (2,3). The column i and row i are also shown (shadedareas) as well as the directions that the column/row has to be sent viathe class function.

One can invert a dense linear matrix using standard algorithms such asGauss-Jordan elimination as well as other methods. In general the I/Orequired is of a special one-to-many variety that is well suited to thecommunication functionality of a parallel supercomputer with hardwareclass function capability. One can utilize the class functionality tomulticast data to an entire row or surface of the machine.

Some of the terms used in the description of this invention areexplained below:

The Gauss-Jordan Algorithm:

The kernel of the Gauss-Jordan algorithm without pivoting is givenbelow. Initially b is an identity matrix and a is the matrix whoseinverse is being computed.

do i=1,N do j=i,N do k=1,N; (k not equal to i) b(k,j) = b(k,j) − [a(k,i)/ a(i,i)] * b(i,j) a(k,j) = a(k,j) − [a(k,i) / a(i,i)] * a(i,j) enddoenddo enddoEquation 1Distributed Memory Parallel Supercomputer:

Such a computer consists of many nodes. Each node has one or moreprocessors that operate-on local memory. The nodes are typicallyconnected as a d-dimensional grid and they communicate via the gridlinks. If the grid is 2-dimensional with P×P processors then an N×Nmatrix can be partitioned so that L×L pieces of it reside on each node

(L=N/P). If the machine is not connected as a 2-dimensional grid theproblem can always be mapped onto it by appropriately “folding” thematrix onto the grid. Without loss of generality and in order to makethe presentation of this invention simple the processor grid will beassumed to be 2-dimensional.

Hardware Class Functions:

Class functions are a hardware implementation of multicast. Suppose thatprocessor p(1,1) (here the numerals indicate the position of theprocessor on the grid, also see FIG. 4) wants to send the same packet ofdata to processors p(1,2), p(1,3) and p(1,4). Typically this is done byfirst sending the data to processor p(1,2). Once the data arrives intop(1,2) software routines read it and store it in memory. Then p(1 ,2),reads the data from memory and sends it to p(1,3) etc. The problem withthis is that it takes a long time to fully receive the packet of datainto memory and then resend it. If the hardware was built so that thepacket of data that arrived into p(1,2) was simultaneously stored intothe p(1,2) memory and immediately sent to p(1,3) then. the delay wouldbe greatly reduced. The hardware function of p(1,1) sending a packet ofdata to p(1,4) while that packet is deposited into the memory of theintermediate processors that it goes through is called the hardwareclass function.

The Invention:

This invention exploits the fact that the communication patterns ofdense matrix inversion (for example-using the Gauss-Jordan method) canutilize class functions. This can be seen from equation 1 that describesthe Gauss-Jordan algorithm:

The a(i,i) are communicated via some other method, for example a globalbroadcast. Then the right hand side of the equations for b(k,j) anda(k,j) involve elements that have only one index different from (k,j)but not both (a(k,i), a(i,j) and b(i,j)). Class function communicationcan be used to send such elements across the relevant processors.

For example, in order to calculate b(k,j) for a given row k (1<j<N) oneneeds a(k,i) to be known for all processors that contain the row k.Therefore, one must send a(k,i) along the row of processors that containthe matrix row k. This can be done using the class functionality. Asalready discussed this results in large reductions in totalcommunication time.

This completes the description of the idea for this invention. The ideawas described for the Gauss-Jordan algorithm but it is not specific toit. For example this idea applies to the “Gauss-Jordan with Pivoting”,“Gaussian Elimination with Back Substitution” and “LU Decomposition”algorithms.

An implementation of this idea (using the Gauss-Jordan algorithm) withall the details is presented below as an example. In order to make theexample easy to understand the simplest implementation was chosen. Morecomplex implementations that result in communications involving largerdata packets have also been worked out. Depending on the size of theprocessor grid and the size of the matrix larger packet sizes may bedesirable since they further improve performance by minimizing latency.However, this does not affect the premise of this idea.

An Example Algorithm:

The Gauss-Jordan algorithm is used to find the matrix inverse of a densematrix of size N×N uniformly spread out on a grid of P×P nodes.Therefore each node has an L×L piece of the matrix in its memory(L=N/P). A hardware class function is used to multicast data across rowsand columns. For a visual picture of this algorithm-please refer to FIG.1 above.

For each 1<i<N

-   1) Using class functions send to the left and right the column i of    a's (a(k,i), 1<k<N)-   2) Scale the elements a, b of row i by a(i,i)-   3) Using class functions send up and down the new row i of a's and    b's (a(i,j) and b(i,j), 1<j<N)    -   4) Now all processors have the necessary elements to do the        standard Gauss-Jordan step for column i. At the end of this        column i is the same as column of the identity matrix.-   Repeat    End of Examples:

While several embodiments and variations of the present invention forclass networking routing are described in detail herein, it should beapparent that the disclosure and teachings of the present invention willsuggest many alternative designs to those skilled in the art.

1. A method of class network routing in a network to allow a computeprocessor in a network of compute processors located at nodes of thenetwork to multicast a message to a plurality of other computeprocessors in the network comprising: dividing, a message into one ormore message packets which pass through the network, each one or moremessage packets including a header having a field including a classvalue; at each switch in the network, using the class value as a vectorindex to a table having an array of stored values that efficientlyencode actions performed by the switch on the message packet, said classvalue determining a switch action of path-based multidrop messagepassing for multiphase multicasting of a message packet through thenetwork along a path comprising an intermediate node and a destinationnode, to determine from said table having an array of stored values at aswitch at each said intermediate node or destination node, whether thatintermediate node or a destination node should deposit a copy of themessage packet, wherein said array of stored values at each switchincludes different entries such that said class network routing allows anode to not receive a copy of a packet, even though the node is on apath of the multidrop message packet, wherein said class network routingfurther comprising using a D-phase multicast from an origin node to allnodes of a D-dimensional cube wherein, in a first phase the origin nodesends a multidrop message to all other nodes in one of the rows of thesending node, in a second phase each of the recipients of the firstphase and the sender of the first phase simultaneously send a multidropmessage to all other nodes in a row orthogonal to the row of the firstphase, in a third phase each of the recipients of the second phase andthe senders of the second phase simultaneously send a multidrop messageto all other nodes in a row orthogonal to the rows of the first andsecond phases, and so on in further phases such that all node of thecube receive the broadcast message after all the phases.
 2. The methodof claim 1, including: providing a class value to implement multidropmessage passing; providing each switch with a first table to determineif a copy of the message packet is to be deposited at a local node ifsaid local node is an intermediate node, and a second table to determineif a copy of the message packet is to be deposited at a local node ifsaid local node is a destination node.
 3. The method of claim 1,including providing a class value determining a switch action ofmultidestination message passing of a message packet to multipledestination nodes in the network, wherein, for each incoming packet at alocal node, said class value is used as an index into a table havingentry values indicating if packet should be routed using otherinformation in the table entry.
 4. The method of claim 1, wherein atable entry is made up of multiple bits for encoding all routingactions, one or more said bits corresponding to one or more outgoinglinks from said switch, said switch duplicating an incoming packet ontomultiple outgoing links.
 5. The method of claim 4, including providing aclass routing table with different values on different incoming links.6. The method of claim 4, wherein the message packet is multicast to anentire row, plane, cube or other collection of contiguous nodes on thenetwork.
 7. The method of claim 4, for a D-dimensional network,including providing 2*D class values for multicast in each of the 2*Ddirections to allow each node in the network to use 2*D multicasts toeffectively broadcast a packet to all other nodes in the mesh.
 8. Themethod of claim 1, including: performing dense matrix inversionalgorithms on a network of distributed memory parallel computers withhardware class function multicast capability, wherein the hardware classfunction multicast capability simultaneously stores into memory amessage packet that arrives and immediately sends the message packet toone or more other nodes while that message packet is being stored intomemory, such that the communication patterns of the dense matrixinversion algorithms are served by the hardware class function multicastcapability to minimize communication delays.
 9. The method of claim 1,wherein the network comprises a network of distributed-memory parallelcomputers; providing each node of the computer network with one or moreprocessors that operate on local memory; coordinating the actions ofmultiple nodes of the computer by using class routing to pass messagesbetween the multiple nodes.
 10. The method of claim 9, including:pairing each node with a switch of the network; connecting the switchesto form a three dimensional toms wherein each switch is 1 inked to sixother switches, the links are coupled to a switch in a positivedirection and also to a switch in a negative direction in each of thethree dimensions; identifying each switch by an x, y, z logical addresson the toms, wherein each node has the address of its switch; includinga field value for the logical address in the packet header, to enablethe packet to identify a destination node.
 11. The method of claim 1,including providing a class value determining a switch action of aunicast of a message packet through the network to a single destinationnode.
 12. The method of claim 1, including providing a class valueenabling a node to perform snooping, wherein a node acquires and storesa copy of all packets passing through the node, including packets notdestined nor deposited at the node to provide information on theperformance of the network.
 13. The method of claim 1, includingproviding class values to determine if a copy of the message packet isto go out on an X link or not, and out on a Y link or not, and out on aZ link or not and so on for the other links of the D dimensions.
 14. Themethod of claim 1, including providing different tables and providingpriorities for different tables to enable a switch to decide betweenconflicting actions indicated by different tables.
 15. The method ofclaim 1, including using class-based multicasting to create otherclasses, such that the contents of a table for a particular class valueis determined by using another class value.
 16. The method of claim 1,wherein said array of stored values at each switch includes severalclass values for multicasting such that said class network routingallows for several multicast groups, each with a different set of nodes.